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Abstract 

The parsimonious Gaussian mixture models, which exploit an eigenvalue de- 
composition of the group covariance matrices of the Gaussian mixture, have 
shown their success in particular in cluster analysis. Their estimation is in 
general performed by maximum lihelihood estimation and has also been con- 
sidered from a parametric Bayesian prospective. We propose new Dirichlet 
Process Parsimonious mixtures (DPPM) which represent a Bayesian non- 
parametric formulation of these parsimonious Gaussian mixture models. The 
proposed DPPM models are Bayesian nonparametric parsimonious mixture 
models that allow to simultaneously infer the model parameters, the optimal 
number of mixture components and the optimal parsimonious mixture struc- 
ture from the data. We develop a Gibbs sampling technique for maximum 
a posteriori (MAP) estimation of the developed DPMM models and provide 
a Bayesian model selection framework by using Bayes factors. We apply 
them to cluster simulated data and real data sets, and compare them to the 
standard parsimonious mixture models. The obtained results highlight the 
effectiveness of the proposed nonparametric parsimonious mixture models as 
a good nonparametric alternative for the parametric parsimonious models. 
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1. Introduction 


Clustering is one of the essential tasks in statistics and machine learning. 
Model-based clustering, that is the clustering approach based on the para- 
metric fìnite mixture model [1], is one of the most popular and successful 
approaches in cluster analysis [2-4]. The finite mixture model decomposes 
the density of the observed data as a weighted sum of a finite number of K 
component densities. Most often, the used model for multivariate real data is 
the finite Gaussian mixture model (GMM) in which each mixture component 
is Gaussian. This paper will be focusing on Gaussian mixture modeling for 
multivariate real data. 

In [3] and [5], the authors developed a parsimonious GMM clustering ap- 
proach by exploiting an eigenvalue decomposition of the group covariance 
matrices of the GMM components, which provides a wide range of very flex- 
ible models with different clustering criteria. It was also demonstrated in 
[4] that the parsimonious mixture model-based clustering framework provide 
very good results in density estimation cluster and discriminant analyses. 

In model-based clustering using GMMs, the parameters of the Gaussian 
mixture are usually estimated into a maximum lihelihood estimation (MLE) 
framework by maximizing the observed data lihelihood. This is usually per- 
formed by the Expectation-Maximization (EM) algorithm [6, 7] or EM ex- 
tensions [7]. The parameters of the parsimonious Gaussian mixture models 
may also be estimated in a MLE framework by using the EM algorithm [5]. 

However, a fìrst issue in the MLE approach using the EM algorithm for 
normal mixtures is that it may fail due to singularities or degeneracies, as 
hilighted namely in [8-10]. The Bayesian estimation methods for mixture 
models have lead to intensive research in the field for dealing with the prob- 
lems encountered in MLE for mixtures [8, 11-18]. They allow to avoid these 
problems by replacing the MLE by the maximum a posterior (MAP) esti- 
mator. This is namely achieved by introducing a regularization over the 
model parameters via prior parameter distributions, which are assumed to 
be uniform in the case of MLE. 

The MAP estimation for the Bayesian Gaussian mixture is performed by 
maximizing the posterior parameter distribution. This can be performed, 
in some situations by an EM-MAP scheme as in [9, 10] where the authors 
proposed an EM algorihtm for estimating Bayesian parsimonious Gaussian 
mixtures. However, the common estimation approach in the case of Bayesian 
mixtures is still the one based on Bayesian sampling such as Markov Ghain 
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Monte Carlo (MCMC) namely Gibbs sampling [8, 11, 15] when the number 
of mixture components K is known, or by reversible jump MCMC intro- 
duced by [19] as in [8, 14]. The flexible eigenvalue decomposition of the 
group covariance matrix described previously was also exploited in Bayesian 
parsimonious model-based clustering by [15, 16] where the authors used a 
Gibbs sampler for the model inference. 

For these model-based clustering approaches, the number of mixture com- 
ponents is usually assumed to be known. Another issue in the hnite mixture 
model-based clustering approach, including the MLE approach as well as 
the MAP approach, is therefore the one of selecting the optimal number of 
mixture components, that is the problem of model selection. The model se- 
lection is in general performed through a two-fold strategy by selecting the 
best model from pre-established inferred model candidates. For the MLE 
approach, the choice of the optimal number of mixture components can be 
performed via penalized log-likehhood criteria such as the Bayesian Infor- 
mation Criterion (BIC) [20], the Akaike Information Criterion (AIC) [21], 
the Approximate Weight of Evidence (AWE) criterion [3] , or the Integrated 
Classihcation Lihelihood criterion (ICL) [22], etc. For the MAP approach, 
this can still be performed via modihed penalized log-likehhood criteria such 
as a modihed version of BIC as in [10] computed for the posterior mode, and 
more generally the Bayes factors [23] as in [15] for parsinionious mixtures. 
Bayes factors are indeed the natural Bayesian criterion for model selection 
and comparison in the Bayesian framework and for which the criteria such 
as BIC, AWE, etc represent indeed approximations. There is also Bayesian 
extensions for mixture models that analyse mixtures with unknown number 
of components, for example the one in [14] using RJMCMC and the one 
in [8, 24] using the Birth and death process. They are referred to as fully 
Bayesian mixture models [14] as they consider the number of mixture com- 
ponents as a parameter to be inferred from the data, jointly with the mixture 
model parameters, based on the posterior distributions. 

However, these standard hnite mixture models, including the non-Bayesian 
and the Bayesian ones, are parametric and may not be well adapted in the 
case of unknown and complex data structure. Recently, the Bayesian-non 
parametric (BNP) formulation of mixture models, that goes back to [25] 
and [26], has took much attention as a nonparametric alternative for fomu- 
lating mixtures. The BNP methods [13, 27] have indeed recently become 
popular due to their hexible modeling capabilities and advances in inference 
techniques, in particular for mixture models, by using namely MCMC sam- 
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pling techniques [28, 29] or variational inference ones [30]. BNP niethods for 
clustering [13, 27], including Dirichlet Process Mixtures (DPM) and Chinese 
Restaurant Process (CRP) niixtures [25, 26, 31-33]represented as Inûnite 
Gaussian Mixture Models (IGMM) [29], provide a principled way to over- 
conie the issues in standard niodel-based clustering and classical Bayesian 
mixtures for clustering. BNP mixtures for clustering are fully Bayesian ap- 
proaches that offer a principled alternative to jointly infer the number of mix- 
ture components (i.e clusters) and the mixture parameters, from the data. 
By using general processes as priors, they allow to avoid the problem of 
singularities and degeneracies of the MLE, and to simultaneously infer the 
optimal number of clusters from the data, in a one-fold scheme, rather than 
in a two-fold approach as in standard model-based clustering. They also 
avoid assuming restricted functional forms and thus allow the complexity 
and accuracy of the inferred models to grow as more data is observed. They 
also represent a good alternative to the difhcult problem of model selection 
in parametric mixture models. 

In this paper, we propose a new BNP formulation of the Gaussian mix- 
ture with the eigenvalue decomposition of the group covariance matrix of 
each Gaussian component which has proven its flexibility in cluster analysis 
for the parametric case [3-5, 15]. A first idea of this approach was presented 
in [34]. We develop new Dirichlet Process mixture models with parsimo- 
nious covariance structure, which results in Dirichlet Process Parsimonious 
Mixtures (DPPM). They represent a Bayesian nonparametric formulation of 
these parsimonious Gaussian mixture models. The proposed DPPM mod- 
els are Bayesian parsimonious mixture models with a Dirichlet Process prior 
and thus provide a principled way to overcome the issues encountered in the 
parametric Bayesian and non-Bayesian case and allow to automatically and 
simultaneously infer the model parameters and the optimal model structure 
from the data, from different models, going from simplest spherical ones to 
the more complex standard general one. We develop a Gibbs sampling tech- 
nique for maximum a posteriori (MAP) estimation of the various models and 
provide an unifying framework for model selection and models comparison 
by using namely Bayes factors, to siniultaneously select the optimal number 
of mixture components and the best parsimonious mixture structure. The 
proposed DPPM are more flexible in terms of modeling and their use in 
clustering, and automatically infer the number of clusters from the data. 

The paper is organized as follows. Section 2 describes and discusses pre- 
vious work on model-based clustering. Then, section 3 presents the proposed 
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models and the learning techniqne. In section 4, we give experimental resnlts 
to evalnate the proposed models on simnlated data and real data. Finally, 
Section 5 is devoted to a discnssion and conclnding remarhs. 

2. Parametric model-based clustering 

Let X = (xi,... ,x„) be a sample of n i.i.d observations in M'^, and let 
z = {zi,...,Zn) be the corresponding nnknown clnster labels where Zi G 
{1,..., iF} represents the clnster label of the ith data point Xj, K being the 
possibly nnknown nnmber of clnsters. 

2.1. Model-based clustering 

Parametric Gaussian clustering, also called model-based clustering [2, 4], 
is based on the finite GMM [1] in which the probability density function of 
the data is given by: 

K 

p(Xí|0) = ^TTfc A/'(Xí|6'fc) (1) 

k=l 

where the TTfc’s are the non-negative mixing proportions that sum to one, = 

{fif., Xfc) are respectively the mean vector and the covariance matrix for the 
kth Gaussian component density and 6 = {tti, ..., kk, ..., Si,..., T,k) 
is the GMM parameter vector. From a generative point of view, the genera- 
tive process of the data for the finite mixture model in this case can be stated 
as follows. First, a mixture component Zi is sampled independently from a 
Multinomial distribution given the mixing proportions tt = {tti, ... ,tïk). 

Then, given the mixture component Zi = k, and the corresponding param- 
eters O^, the data Xj are generated independelty from a Gaussian with pa- 
rameters 0k of component k. This is summarized by the two steps: 

Zi ~ Mult(7r) (2) 

~ AA(xj|6',J. (3) 

The mixture model parameters 0 can be estiniated in a Maximum Likelihood 

estimation (MLE) framework by maximizing the observed data likelihood (4): 

n K 

p(X|0) = IIE’'* j\f{-Ki\ek). (4) 

i=l k=l 

The maximum likelihood estimation usually relies on the Expectation-Maximization 
(EM) algorithm [6, 7] or EM extensions [7]. 
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2.2. Bayesian model-based clustering 

As stated in Section 1, the MLE approach using the EM algorithni for 
nornial niixtures niay fail due to singularities or degeneracies [8-10]. The 
Bayesian approach of niixture niodels avoids the problenis associated with 
the niaxiniuni lihelihood described previously. The paranieter estiniation for 
the Gaussian niixture niodel in the Bayesian approach is perfornied in a MAP 
estiniation franiework by niaxiniizing the posterior paranieter distribution 

p(0|X)=p(0MX|0), (5) 

p{6) being a chosen prior distribution over the niodel paranieters 6. The 
prior distribution in general takes the following forni for the GMM; 

K 

P{6) = p{7T\a)p{n\'S,fiQ,Ko)p{'S\^i,Ao,n) = JJp(7rfc|o:)p(/Xfc|Sfc)p(Sfc). 

k=l 

( 6 ) 

where {a, Hq, ko, Ao,no) are hyperparanieters. A comnion choice for the 
GMM is to assume conjugate priors, that is Dirichlet distribution for the mix- 
ing proportions tt as in [14, 35], and a multivariate normal Inverse-Wishart 
prior distribution for the Gaussian parameters, that is a multivariate normal 
for the means /x and an Inverse-Wishart for the covariance matrices S for 
example as in [9, 10, 15]. 

From a generative point of view, to generate data from the Bayesian 
GMM, a hrst step is to sample the model parameters from the prior, that is 
to sample the mixing proportions from their conjugate Dirichlet prior distri- 
bution, and the mean vectors and the covariance matrices of the Gaussian 
components from the corresponding conjugate multivariate normal Inverse- 
Wishart prior. The generative procedure stills the same as in the previously 
described generative process, and is summarized by the following steps: 


77 o: ~ Dir (— 

' \K' 

Oìk\ 

•’ kJ 

(7) 

2 ;^ 77 ~ Mult(77) 


(8) 

6zi\Go ~ Go 


(9) 

^i\6zi ~ M{xi\6^.) 


(10) 


where o: are hyperparameters of the Dirichlet prior distribution, and Gq is 
a prior distribution for the parameters of the Gaussian component, that is a 
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multivariate Normal Inverse-Wishart distribution for the GMM case: 


Efc ~ X>V(í^o;Ao) 

(11) 


(12) 


where the XVV stands for the Inverse-Wishart distribution. 

The parameters 0 of the Bayesian Gaussian mixture are estimated by 
MAP estimation by maximizing the posterior parameter distribution (5). 
The MAP estimation can still be performed by EM, namely in the case of 
conjugate priors where the prior distribution is only considered for the pa- 
rameters of the Gaussian components, as in [9, 10]. However, in general, 
the common estimation approach in the case the Bayesian GMM described 
above, is the one using Bayesian sampling such as MGMG sampling tech- 
niques, namely the Gibbs sampler. [8, 11, 13, 15, 35-37]. 

2.3. Parsimonious Gaussian mixture models 

The GMM clustering has been extended to parsimonious GMM cluster- 
ing [3, 5] by exploiting an eigenvalue decomposition of the group covariance 
matrices, which provides a wide range of very flexible models with differ- 
ent clustering criteria. In these parsimonious models, the group covariance 
matrix for each cluster k is decomposed as 

Sfc = AfcDfcAfcDj (13) 

where Afc = |Sfc|^/'^, Dfc is an orthogonal matrix of eigenvectors of Sfc and 
Afc is a diagonal matrix with determinant 1 whose diagonal elements are 
the normalized eigenvalues of Sfc in a decreasing order. As pointed in [5], 
the scalar Afc determines the volume of cluster k, Dfc its orientation and 
Afc its shape. Thus, this decomposition leads to several flexible models [5] 
going from simplest spherical models to the complex general one and hence 
is adapted to various clustering situations. 

The parameters 6 of the parsimonious Gaussian mixture models are es- 
timated in a MLE framework by using the EM algorithm. The details of 
the EM algorithm for the different parimonious ûnite GMMs are given in [5] . 
The parsimonious GMMs have also took much attention under the Bayesian 
prospective. For example, in [15], the authors proposed a fully Bayesian 
formulation for inferring the previously described parsinionious hnite Gaus- 
sian mixture models. This Bayesian formulation was applied in model-based 
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cluster analysis [15, 16], The model inference in this Bayesian formulation 
is performed in a MAP estimation framework by using MCMC sampling 
techniques, see for example [15, 16]. Another Bayesian regularization for the 
parsimonious GMM was proposed by [9, 10] in which the maxiniization of the 
posterior can still be performed by the EM algorithm in the MAP framework. 

2.4- Model selection in finite mixture models 

Finite mixture model-based clustering requires to specify the number of 
mixture components (i.e., clusters) and, in the case of parsimonious models, 
the type of the model. The main issues in this parametric model are there- 
fore the one of selecting the number of mixture components (clusters), and 
possibly the type of the model, that fit at best the data. This problem can 
be tackled by penalized log-likelihood criteria such as BIC [20] or penalized 
classification log-likelihood criteria such as AWE [3] or ICL [22] , etc, or more 
generally by using Bayes factors [23] which provide a general way to select 
and compare models in (Bayesian) statistical modeling, namely in Bayesian 
mixture models. 

3. Dirichlet Process Parsimonious Mixture (DPPM) 

However, the Bayesian and non-Bayesian finite mixture models described 
previously are in general parametric and may not be well adapted to rep- 
resent complex and realistic data sets. Recently, the Bayesian-non para- 
metric (BNP) mixtures, in particular the Dirichlet Process Mixture (DPM) 
[25, 26, 32, 33] or by equivalence the Chinese Restaurant Process (CRP) 
mixture [33, 38, 39], which may be seen and an infinite mixture model [29], 
provide a principled way to overcome the issues in standard model-based 
clustering and classical Bayesian mixtures for clustering. They are fully 
Bayesian approaches and offer a principled alternative to jointly infer the 
number of mixture components (i.e clusters) and the mixture parameters, 
from the data. In the next section, we rely on the Dirichlet Process Mixture 
(DPM) formulation to derive the proposed approach. 

BNP mixture approaches for clustering assume general process as prior 
on the infinite possible partitions, which is not restrictive as in classical 
Bayesian inference. Such a prior can be a Dirichlet Process [25, 26, 33] 
or, by equivalence, a Chinese Restaurant Process [33, 39]. 
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3.1. Dirichlet Process Parsimonious Mixture 

A Dirichlet Process (DP) [25] is a distribution over distributions and has 
two parameters, the concentration parameter «0 > 0 and the base measure 
Gq. We denote it by DP(qí, Gq). Assume there is a parameter following a 
distribution G, that is fljlG ~ G. Modeling with DP means that we assume 
that the prior over G is a DP, that is, G is itself generated from a DP, that is 
G ~ DP(q;, Gq). This can be summarized by the following generative process: 

èi\G ~ G, Vi e (14) 

G|a,Go ~ DP^a^Go)- (15) 

The DP has two properties [25]. First, random distributions drawn from 
DP, that is G ~ DP(q;,Go), are discrete. Thus, there is a strictly positive 
probability of multiple observations Oi tahing identical values within the set 
(01, • • • , On). Suppose we have a random distribution G drawn from a DP 
followed by repeated draws (0i,..., O^) from that random distribution [40] 
introduced a Polya urn representation of the joint distribution of the random 
variables {01,---,0^), that is 

p(01, ...,0n)= p{0i)p{02\0l)pi03\01, 02) ■ ■ ■p{0n\0l,02, • • • , ^n-l), (16) 


which is obtained by marginalizing out the underlying random measure G: 

,\G^dp{G\a,Go) (17) 

and results in the following Polya urn representation for the calculation of 
the predictive terms of the joint distribution (16): 


p{0i 


, 0n\oí, Gr,) — 



Y[p{0 


i=l 


0i\01, ..■0i-l 


Oío 


2—1 


ao T i — 1 
ao 

ao ri i — 1 


Go + E 


^^ao + i- l 


3 

Ki-1 


G«+E 


nk 


k=l 


ao + i — 1 




(18) 

(19) 


where /í",-! — is the iiumber of clusters after i — 1 samples, 

denotes the number of times each of the parameters {0k}^=i occurred in the 
set {0}jhi. The DP therefore place its probability mass on a countability 
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infìnite collection of points, also called atonis, that is an infinite niixture of 
Dirac deltas [25, 33, 41]: 

OO 

G = ~ Go, A;=l,2,..., ( 20 ) 

k=l 

where TTfc represents the probability assigned to the kiìi atoni which satisfy 
Y.T=i'^k = 1, and 0k is the location or value of that coniponent (atoni). 
These atonis are drawn independently froni the base nieasure Gq. Hence, 
according to the DP process, the generated paranieters exhibit a clustering 
property, that is, they share repeated values with positive probability where 
the unique values of Oi shared aniong the variables are independent draws for 
the base distribution Gq [25, 33]. The Dirichlet process therefore provides a 
very interesting approach for clustering perspective, when we do not have a 
ûxed nuniber of clusters, in other words having an inûnite niixture saying K 
tends to inûnity. Consider a set of observations (xi,..., x„) to be clustered. 
Clustering with DP adds a third step to the DP (15), that is we assunie 
that the randoni variables Xj, given the distribution paranieters 0^ which 
are generated froni a DP, are generated froni a distribution /(.|0j). This is 
the DP Mixture niodel (DPM) [26, 32, 33, 42]. The DPM adds therefore 
a third step to the DP, that is the of generating randoni variables Xj given 
the distribution paranieters O^. The generative process of the DP Mixture 
(DPM) is as follows: 


G Q!, Gq ~ 

DP{a, Gq) 

(21) 

0.\G ~ 

G 

(22) 

Xj ~ 

/(x/ê*) 

(23) 


where /(xj|0j) is a cluster-speciûc density, for exaniple a niultivariate Gaus- 
sian density in the case of DP niultivariate Gaussian niixture, where O^ is 
coniposed of a niean vector and a covariance niatrix. In that case, Gq niay 
be a niultivariate nornial Inverse-Wishart conjugate prior. When K tends 
to inûnity, it can be shown that the ûnite niixture niodel (1) converges to 
a Dirichlet process niixture niodel [28, 29, 43]. The Dirichlet process has 
a nuniber of properties which niake inference based on this nonparametric 
prior computationally tractable. It has a interpretation in term of the GRP 
mixture [33, 39]. Indeed, the second property of the DP, that is the fact that 
random parameters drawn from a DP exhibit a clustering property, con- 
nects the DP to the GRP. Gonsider a random distribution drawn from DP 
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G ~ DP{a, Gq) followed by a repeated draws from that random distribution 
6i ~ G,\/i G 1,... ,n. The structure of shared values defines a partition 
of the integers from 1 to n, and the distribution of this partition is a CRP 
[25, 33]. 

3.2. Chinese Restaurant Process (CRP) parsimonious mixture 

Consider the unknown cluster labels z = {zi,..., z^) where or each value 
Zi is an indicator random variable that represents the label of the unique 
value 6zi of such that for all i{l,... ,n}. The CRP provides a 

distribution on the infinite partitions of the data, that is a distribution over 
the positive integers 1,... ,n. Consider the following joint distribution of the 
unknown cluster assignments {zi,..., Zn). 

p{zi, ...,Zn)= p{zi)p{z 2 \zi) . . .p{Zn\Zi,Z 2 , . . ■ , Zn-l)' (24) 


From the Polya urn distribution (19), each predictive term of the joint dis- 
tribution (24) can be computed is given by: 


Ki-, 


p{zi = k\zi,...,Zi-i-,ao) = 


«0 


ao + i -1 


S{zi, Ki—i -|- 1 ) -|- ^ ( 


nk 


k=l 


ao + i — 1 


6{zi,k)- 


(25) 

where n^ = is the number of indicator random variable taking 

the value k, and Ki_i -|- 1 is the previously unseen value. From this distribu- 
tion, one can therefore allow assigning new data to possibly previously unseen 
(new) clusters as the data are observed, after starting with one cluster. The 
distribution on partitions induced by the sequence of conditional distribu- 
tions in Eq. (25) is commonly referred to as the Chinese Restaurant Process 
(CRP). It can be interpreted as follows. Suppose there is a restaurant with 
an inûnite number of tables and in which customers are entering and sitting 
at tables. We assume that customers are social, so that the ith customer 
sits at table k with probability proportional to the number of already seated 
customers n^ {k < Ki_i being a previously occupied table), and may choose 
a new table {k > Ki_i, k being a new table to be occupied) with a probability 
proportional to a small positive real number a, which represents the CRP 
concentration parameter. 

In clustering with the CRP, customers correspond to data points and 
tables correspond to clusters. In CRP mixture, the prior CRP( 2 ;i,..., Zi_i; a) 
is completed with a lihelihood with parameters 6^ with each table (cluster) 
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k (i.e., a multivariate Gaussian likelihood with mean vector and covariance 
matrix in the GMM case), and a prior distribution {Gq) for the parameters. 
For example, in the GMM case, one can use a conjugate multivariate normal 
Inverse-Wishart prior distribution for the mean vectors and the covariance 
matrices. This corresponds to the zth customer sits at table Zi = k chooses 
a dish (the parameter from the prior of that table (cluster). The GRP 
mixture can be summarized according to the following generative process. 


GRP(zi,..., Zi_i] a) 
Gq 


(26) 

(27) 

(28) 



where the GRP distribution is given by Eq. (24), Gq is a base measure (the 
prior distribution) and /(xjlô^J is a cluster-specihc density. In the DPM 
and GRP mixtures with multivariate Gaussian components, the parameters 
6 of each cluster density are composed of a mean vector and a covariance 
matrix. In that case, a common base measure Gq is a multivariate normal 
Inverse-Wishart conjugate prior. 

We note that in the proposed DP parsimonious mixture, or by equiva- 
lence, GRP parsimonious mixture, the cluster covariance matrices are parametrized 
in term of an eigenvalue decomposition to provide more ûexible clusters with 
possibly different volumes, shapes and orientations. In terms of a GRP inter- 
pretation, this can be seen as a variability of dishes for each table (cluster). 

We indeed use the eigenvalue value decomposition described in section 2.3 
which till now has been considered only in the case of parametric ûnite mix- 
ture model-based clustering (eg. see [3, 5]), and Bayesian parametric ûnite 
mixture model-based clustering (eg. see [9, 10, 15, 16].) We investigate eight 
parsimonious models, covering the three families of the mixture models; the 
general, the diagonal and the spherical family. The parsimonious models 
therefore go from the simplest spherical one to the more general fuh model. 

Table 1 summarizes the considered parsimonious Gaussian mixture models, 
the corresponding prior distribution for each model and the corresponding 
number of free parameters for a mixture model with K components for data 
in dimension d. We used conjugate priors, that is Dirichlet distribution for 
the mixing proportions tt [14, 35], and a multivariate Normal for the mean 
vector /j,, and and an Inverse-Wishart or an Inverse-Gamma prior, depending 
on the parsimonious model, for the covariance matrix, S [10, 15]. 
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Model 

Type 

Prior 

Applied to 

//: free parameters 

AI 

Afcl 

AA 

AfcA 

ADAD^ 

AfcDAD^ 

ADfcAD^ 

AfcDfcAD^’ 

AfcDfcAfcD^ 

Spherical 

Spherical 

Diagonal 

Diagonal 

General 

General 

General 

General 

General 

XQ 

XQ 

XQ 

XQ 

xyv 

XQ and XW 
XQ 

XQ 

XW 

A 

Afc 

diag. elements of AA 
diag. elements of AfcA 
S = ADAD^ 

Afc and S = DAD^ 
diag. elements of AA 
diag. elements AfcA 

^fc ~ AfcDfcAfcD^ 

r; + 1 

V + d 

V + d 

V + d + K — \ 

V + íú 

V + uj + K — \ 

V + Klu — {K — l)(i 

v +Kuj- {K -\){d-\) 

V + Kuj 


Table 1: Considered Parsimonious models via eigenvalue decomposition, the associated 
prior for the covariance structure and the corresponding number of free parameters where 
X denotes an inverse distribution, Q a Gamma distribution and W a Wishart distribution, 
V = {K — 1) + Kd and uj = d{d + l)/2, K being the number of mixture components and 
d the number of variables for each individual. 


3.3. Bayesian leaming via Gibbs sampling 

Given n observations X = (xi,..., x„) niodeled by the proposed Dirich- 
let process parsinionious niixture (DPPM), the aini is to infer the nuni- 
ber K of latent clusters underlying the observed data, their paranieters 
© = {6i,... ,6 k) and the latent cluster labels z = (zi,...,z„). We de- 
veloped an MCMC Gibbs sanipling technique, as in [28, 29, 32], to learn the 
proposed Bayesian nonparanietric parsinionious niixture niodels. 

The Gibbs sanipler for mixtures performs in an iterative way as follows. 
Given an initial mixture parameters 6^^\ and a prior over the missing labels 
z (here a conjugate Dirichlet prior), the Gibbs sampler, instead of estimat- 
ing the missing labels 7S^\ simulates them from their posterior distribution 
p(z|X, 6^ at each iteration t, which is in this case a Multinomial distribu- 
tion whose parameters are the posterior class probabilities. Then, given the 
completed data and the prior distribution p{6) over the mixture parameters, 
the Gibbs sampler generates the mixture parameters 0 h+i) fi-om the corre- 
sponding posterior distribution p(0|X, z^*^^^), which is in this case a multi- 
variate Normal Inverse-Wishart, or a a Normal-Inverse-Gamma distribution, 
depending on the parsimonious model. This Bayesian sampling procedure 
produces an ergodic Marhov chain of samples {6^*'^ with a stationary distri- 
bution p{6\X). Therefore, after initial M burn-in steps in N Gibbs samples, 
the variables {6^^+^\ ...,6^^\, can be considered to be approximately dis- 
tributed according to the posterior distribution p(0|X). The Gibbs sampler 
consists in sampling the couple (©,z) from their posterior distribution. The 
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posterior distribution for given all the other variables is given by 

p(0fc|z, X, 0_fc, 77, «o; oc JJ /(Xi|zí = k; 6k)p{0k] H) (29) 

i\zi=k 

where ©_fc = (0i,..., 0fc_i, 0fc_i,..., ^jCi.i) cind p(0fc; H) is the prior distri- 
bution for O^, that is Gq, with H being the hyperparameters of the model. 
The cluster labels Zi are similarly sampled from posterior distributions which 
is given by 


p{zi = k\z_i,X,&,TT,ao) oc f{yii\zi;&)p{zi\z_i-,ao) (30) 

where z_i = {zi ,..., Zi-i, Zj+i,..., Zn), and p{zi\z-i; «o) is the prior predic- 
tive distribution corresponds which to the CRP distribution computed as in 
Equation (25). The prior distribution, and the resulting posterior distribu- 
tion, for each of the considered models, are close to those in [15] and are 
provided in details in the supplementary material, also available here. 

3.3.1. Sampling the hyperparameter a of the DPPM 

The number of mixture components in the models depends on the hyper- 
parameter a of the Dirichlet Process [26]. We therefore choose to sample it 
to avoid ûxing an arbitrary value for it. We follow the method introduced 
by [12] which consists in sampling by assuming a prior Gamma distribution 
a ~ Ç{a,b) with a shape hyperparameter a > 0 and scale hyperparam- 
eter ò > 0. Then, a variable p is introduced and sampled conditionally 
on a and the number of clusters Ki-i, according to a Beta distribution 
ri\a,Ki-i ~ B{a + l,n). The resulting posterior distribution for the hy- 
perparameter a is given by: 

p{a\g, K) ~ PriG {a + Ki-i,b - log {g)) + {1 - Prì) G {a + H-i - l,b - log {g)) 

(31) 

where the weights dr, = '^^^ffi{b-ìog{rì)) • developed Gibbs sampler 
is summarized by the pseudo-code (1). The retained solution is the one 
corresponding to the posterior mode of the number of mixture components, 
that is the one that appears the most frequently during the sampling. 

3 . 4 . Bayesian model selection and comparison via Bayes factors 

This section provides the used strategy for model selection and compar- 
ison, that is, the choice of the number of mixture components (clusters) for 
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Algorithm 1 Gibbs sampling for the proposed IPGMM 
Inputs: Data set (xi,..., x„) and ^ Gibbs samples 
1: Initialize the model hyperparameters H. 

2: Start with one cluster Ki = 1, 0i = {pi, Xli} 

3: for í = 2,..., ^samples do 
4: for í = 1,..., n do 

5: for k = 1, ..., Aj_i do 

6: if (nfc = X]í=i Zik) -1 = 0 then 

7: Decrease Aj_i = Aj_i — 1; let {0^*^} ^ {0^)} \ 0^, 

8: end if 

9: end for 

10: Sample a cluster label zf'^ from the posterior: 

, X, 0^*^, H) oc p(xi|zí, 0(b)CRP(z\^,; a) 

11: if zf'^ = Ki-i + 1 then 

12: Increase Ki-i = Aj^i + 1 (We get a new cluster) and sample a new 

cluster parameter of from the prior distribution as in Table 1 
13: end if 

14: end for 

15: for A: = 1,..., Ki-\ do 

16: Sample the parameters of from the posterior distribution. 

17: end for 

18: Sample the hyperparameter oh) p(ah)|ifj_i) from the posterior (31) 

19: zh+^) ^ zh) 

20: end for 

Outputs: {0,z,Â = Ki-i} 


a given model, and the selection of the best model from the different par- 
simonious models. We use Bayes factors [23] which provide a general way 
to compare models in (Bayesian) statistical modeling, and has been widely 
studied in the case of mixture models [15, 23, 44-46]. Suppose that we have 
two model candidates Mi and M^, if we assume that the two models have 
the same prior probability p{Mi) = p^M^), the Bayes factor is given by 


p(X|Mi) 

p(X|M2) 


(32) 


which corresponds to the ratio between the marginal lihelihoods of the two 
models Mi and M^. It is a summary of the evidence for model Mi against 
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model M 2 given the data X. The marginal lihelihood p(X|Mm) for model 
Mm, m G {1,2}, also called the integrated lihelihood, is given by 

p(X|Mm) = Jp{X\e^, Mm)piO^\M^)dOm (33) 

where p{x.\Om, Mm) is the lihelihood of model Mm with parameters Om and 
p{Ok\M.m) is the prior density of the mixture parameters Om for model M^. 

As it is difíicult to compute analytically the marginal lihelihood (33), several 
approximations have been proposed to approximate it. One of the most used 
approximations is the Laplace-Metropolis approximation [47] given by 

PLaplace(X|M,„) = (^Tt)|H| ^p(X10^, M,,)p(é^|M,,) (34) 

where Om is the posterior estimation of Om (posterior mode) for model Mm, 

Vm is the number of free parameters of the mixture model Mm as given in Ta- 
ble 1, and H is minus the inverse Hessian of the function log(p(X|0m, Mm)p{Om\Mm)) 
evaluated at the posterior mode of Om, that is Om- The matrix H is asymp- 
totically equal to the posterior covariance matrix [47] , and is computed as the 
sample covariance matrix of the posterior simulated sample. We note that, 
in the proposed DPPM models, as the number of components K is itself a 
parameter in the model and is changing during the sampling, which leads to 
parameters with different dimension, we compute the Hessian matrix H in 
Eq. (34) by tahing the posterior samples corresponding to the posterior mode 
of K. Once the estimation of Bayes factors is obtained, it can be interpreted 
as described in Table 2 as suggested by [48], see also [23]. 


Cd 

to 

21og BFi 2 

Evidence for model Mi 

< 1 

< 0 

Negative (M^ is selected) 

1-3 

0-2 

Not bad 

3-12 

2-5 

Substantial 

12- 150 

5-10 

Strong 

> 150 

> 10 

Decisive 


Table 2: Model comparaion and selection using Bayes factors. 


4. Experiments 

We performed experiments on both simulated and real data in order to 
evaluate our proposed DPPM models. We assess their ffexibility in terms of 
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modeling, their use for clustering and inferring the number of clusters from 
the data. We show how the proposed DPPM approach is able to automati- 
cally and simultaneously select the best model with the optimal number of 
clusters by using the Bayes factors, which is used to evaluate the results. We 
also perform comparisons with the ûnite model-based clustering approach 
(as in [10, 15]), which will be abbreviated as PGMM approach. We also 
use the Rand index to evaluate and compare the provided partitions, and 
the misclassiûcation error rate when the number of estimated components 
equals the actual one. 

For the simulations, we consider several situations of simulated data, 
from different models, and with different levels of cluster separations, in 
order to assess the efhciency of the proposed approach to retrieved the actual 
partition with the actual number of clusters. We also assess the stability of 
our proposed DPPMs models regarding the choice of the hyperparameters 
values, by considering several situations and varying them. Then, we perform 
experiments on several real data sets and provide numerical results in terms 
of comparisons of the Bayes factors (via the log marginal lihelihood values) 
and as well the Rand index and the misclassiûcation error rate for data sets 
with known actual partition. In the experiments, for each of the compared 
approaches and for each model, each Gibbs is run ten times with different 
initializations. Each Gibbs run generates 2000 samples for which 100 burn- 
in samples are removed. The solution corresponding to the highest Bayes 
factor, of those ten runs, is then selected. 

4.1. Experiments on simulated data 

4-1.1. Varying the clusters shapes, orientations, volumes and separation 

In this experiment, we apply the proposed models on simulated data 
simulated according to different models, and with different level of mixture 
separation, going from poorly separated mixtures to very-well separated mix- 
tures. To simulate the data, we hrst consider an experimental protocol close 
to the one used by [5] where the authors considered the parsimonious mixture 
estimation within a MLE framework. This therefore allows to see how do the 
proposed Bayesian nonparametric DPPM perform compared to the standard 
parametric non-Bayesian one. We note however that in [5] the number of 
components was known a priori and the problem of estimating the number 
of classes was not considered. We have performed extensive extensive experi- 
ments involving all the models and many Monte Garlo simulations for several 
data structure situations. Given the variety of models, data structures, level 
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of separation, etc, it is not possible to display all the results in the paper. We 
choose to perforni in the sanie way as in the standard paper [5] by selecting 
the results display, for the experinients on siniulated data, fo six niodels of 
different structures. The data are generated froni a two coniponent Gaus- 
sian niixture in with 200 observations. The six different structures of the 
mixture that have been considered to generate the data are: two spherical 
models: AI and Afcl, two diagonal models: AA and AfcA and two general mod- 
els ADAD^ and AfcDAD^. Table (3) shows the considered model structures 
and the respective model parameter values used to generate the data sets. 
Let us recall that the variation in the volume is related A, the variation of the 


Model 

Parameters values 

AI 

Afcl 

AA 

AfeA 

ADAD^ 

AfeDAD^ 

A = 1 

Afe = {l,5} 

A = 1; A = diag(3,1/3) 

Afe = {l,5}; A = diag(3,1/3) 

\ 1 . T~) — \/2 \/2 . \/2 \/2 

^ 2 2 ’ 2 2 

Afe ={1,5}; D= 4\ 


Table 3: Considered two-component Gaussian mixture with diíFerent structures. 

shape is related to A and the variation of the orientation is related to D. Fur- 
thermore, for each type of model structure, we consider three different levels 
of mixture separation, that is: poorly separated, well separated, and very- 
well separated mixture. This is achieved by varying the following distance 
between the two mixture components “ ^ 2 )- 

We consider the values q = {1, 3,4.5}. As a result, we obtain 18 different data 
structures with poorly (p = 1), well (p = 3) and very well (p = 4.5) separated 
mixture components. As it is difíicult to show the figures for all the situa- 
tions and those of the corresponding results, in Figure (1), we show for three 
models with equal volume across the mixture components, different data sets 
with varying level of mixture separation. Respectively, in Figure (2), we show 
for the models with varying volume across the mixture components, different 
data sets with varying level of mixture separation. We compare the pro- 
posed DPPM to the parametric PGMM approach in model-based clustering 
[15], for which the number of mixture components was varying in the range 
K = 1,..., 5 and the optinial number of mixture components was selected 
by using the Bayes factor (via the log marginal lihelihoods). For these data 
sets, the used hyperparameters was as follows: /Xq was equal to the mean of 
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Figure 1: Examples of simulated data with the same volume across the mixture com- 
ponents: spherical model AI with poor separation (left), diagonal model AA with good 
separation (middle), and general model ADAD^ with very good separation (right). 



Figure 2: Examples of simulated data with the volume changing across the mixture com- 
ponents: spherical model Afcl with poor separation (left), diagonal model AfcA with good 
separation (middle), and general model AfcDAD^ with very good separation (right). 


the data, the shrinhage = 5, the degree of freedom i/q = d + 2, the scale 
matrix Aq was equal to the covariance of the data, and the hyperparameter 
for the spherical models Sq as the greatest eigenvalue of Aq. 

4.1.2. Ohtained results 

Tables 4, 5 and 6 provide the obtained approximated log marginal hke- 
lihoods obtained by the PGMM and the proposed DPPM models, for, re- 
spectively, the equal (with equal clusters volumes) spherical data structure 
model (AI) and poorly separated mixture (p = 1), the equal diagonal data 
structure model (AA) and good mixture separation (p = 3), and the equal 
general data structure model (ADAD^) and very good mixture separation 
{q = 4.5). Tables 7, 8 and 9 provide the obtained approximated log marginal 
lihelihoods obtained by the PGMM and the proposed DPPM models, for, 
respectively, the different (with different clusters volumes) spherical data 
structure model (Afcl) and poorly separated mixture (p = 1), the different 
diagonal data structure model (A^A) with good mixture separation (p = 3), 
and the different general data structure model (A^DAD^) with very good 
mixture separation (p = 4.5). 
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DPPM 

PGMM 

Model 

K 

log ML 

K = 1 

K = 2 

K = 3 

K = 4 

K = 5 

AI 

2 

-604.54 

-633.88 

-631.59 

-635.07 

-587.41 

-595.63 

Afel 

2 

-589.59 

-592.80 

-589.88 

-592.87 

-593.26 

-602.98 

AA 

2 

-589.74 

-591.67 

-590.10 

-593.04 

-598.67 

-599.75 

AfeA 

2 

-591.65 

-594.37 

-592.46 

-595.88 

-607.01 

-611.36 

ADAD^ 

2 

-590.65 

-592.20 

-589.65 

-596.29 

-598.63 

-607.74 

AfcDAD^ 

2 

-591.77 

-594.33 

-594.89 

-597.96 

-594.49 

-601.84 


Table 4: Log marginal likelihood values obtained by the proposed DPPM and PGMM for 
the generated data with AI model structure and poorly separated mixture {g = 1). 



DPPM 

PGMM 

Model 

K 

log ML 

K = l 

K = 2 

K = 3 

K = 4 

K = 5 

AI 

2 

-730.31 

-771.39 

-702.38 

-703.90 

-708.71 

-840.49 

Afel 

2 

-702.89 

-730.26 

-702.30 

-704.68 

-708.43 

-713.58 

AA 

2 

-679.76 

-704.40 

-680.03 

-683.13 

-686.19 

-691.93 

AfcA 

2 

-685.33 

-707.26 

-688.69 

-696.46 

-703.68 

-712.93 

ADAD^ 

2 

-681.84 

-693.44 

-682.63 

-688.39 

-694.25 

-717.26 

AfeDAD^ 

2 

-693.70 

-695.81 

-684.63 

-688.17 

-694.02 

-695.75 


Table 5: Log marginal liLelihood values obtained by the proposed DPPM and the PGMM 
for the generated data with AA model structure and well separated mixture (g = 3). 



DPPM 

PGMM 

Model 

K 

log ML 

K = 1 

K = 2 

K = 3 

K = 4 

K = 5 

AI 

2 

-762.16 

-850.66 

-747.29 

-746.09 

-744.63 

-824.06 

Afcl 

2 

-748.97 

-809.46 

-748.17 

-751.08 

-756.59 

-766.26 

AA 

2 

-746.05 

-778.42 

-746.32 

-749.59 

-753.64 

-758.92 

AfeA 

2 

-751.17 

-781.31 

-752.66 

-761.02 

-772.44 

-780.34 

ADAD^ 

2 

-701.94 

-746.11 

-698.54 

-702.79 

-707.83 

-716.43 

AfeDAD"^ 

2 

-702.79 

-748.36 

-703.35 

-708.77 

-715.10 

-722.25 


Table 6: Log marginal lihelihood values obtained by the proposed DPPM and PGMM 
for the generated data with ADAD^ model structure and very well separated mixture 
(e = 4.5). 


From theses results, we can see that, the proposed DPPM, in all the 
situations (except for the first situation in Table 4) retrieves the actual model, 
with the actual number of clusters. We can also see that, except for two 
situations, the selected DPPM model, has the highest log marginal lihelihood 
value, compared to the PGMM. We also observe that the solutions provided 
by the proposed DPPM are, in some cases more parsimonious than those 
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DPPM 

PGMM 

Model 

K 

log ML 

K = 1 

K = 2 

K = 3 

K = 4 

K = 5 

AI 

3 

-843.50 

-869.52 

-825.68 

-890.26 

-906.44 

-1316.40 

Afcl 

2 

-805.24 

-828.39 

-805.21 

-808.43 

-811.43 

-822.99 

AA 

2 

-820.33 

-823.55 

-821.22 

-825.58 

-828.86 

-838.82 

AfcA 

2 

-808.32 

-826.34 

-808.46 

-816.65 

-824.20 

-836.85 

ADAD^ 

2 

-824.00 

-823.72 

-821.92 

-830.44 

-841.22 

-852.78 

AfcDAD^ 

2 

-821.29 

-826.05 

-803.96 

-813.61 

-819.66 

-821.75 


Table 7: Log marginal likelihood values and estimated number of clusters for the generated 
data with Afel model structure and poorly separated mixture {g = 1). 



DPPM 

PGMM 

Model 

K 

log ML 

K = 1 

K = 2 

K = 3 

K = 4 

A = 5 

AI 

3 

-927.01 

-986.12 

-938.65 

-956.05 

-1141.00 

-1064.90 

Afcl 

3 

-912.27 

-944.87 

-925.75 

-911.31 

-914.33 

-918.99 

AA 

3 

-899.00 

-918.47 

-906.59 

-911.13 

-917.18 

-926.69 

A/c A 

2 

-883.05 

-921.44 

-883.22 

-897.99 

-909.26 

-928.90 

ADAD^ 

2 

-903.43 

-918.19 

-902.23 

-906.40 

-914.35 

-924.12 

AfcDAD^ 

2 

-894.05 

-920.65 

-876.62 

-886.86 

-904.45 

-919.45 


Table 8: Log marginal lihelihood values obtained by the proposed DPPM and PGMM for 
the generated data with A^A model structure and well separated mixture (p = 3). 



DPPM 

PGMM 

Model 

K 

log ML 

K = 1 

K = 2 

K = 3 

K = 4 

K = 5 

AI 

2 

-984.33 

-1077.20 

-1021.60 

-1012.30 

-1021.00 

-987.06 

Afcl 

3 

-963.45 

-1035.80 

-972.45 

-961.91 

-967.64 

-970.93 

AA 

2 

-980.07 

-1012.80 

-980.92 

-986.39 

-992.05 

-999.14 

AfeA 

2 

-988.75 

-1015.90 

-991.21 

-1007.00 

-1023.70 

-1041.40 

ADAD^ 

3 

-931.42 

-984.93 

-939.63 

-944.89 

-952.35 

-963.04 

AfcDAD^ 

2 

-921.90 

-987.39 

-921.99 

-930.61 

-946.18 

-956.35 


Table 9: Log marginal lihelihood values obtained by the proposed DPPM and PGMM 
for the generated data with AfcDAD^ model structure and very well separated mixture 
(p = 4.5). 


provided by the PGMM, and, in the other cases, the sanie as those provided 
by the PGMM. For exaniple, in Table 4, which corresponds to data froni 
poorly separated niixtnre, we can see that the proposed DPPM selects the 
spherical niodel which is niore parsinionions than the general niodel AA 
selected by the PGMM, with a better niisclassification error (see Table 10). 
The same thing can be observed in Table 8 where the proposed DPPM selects 
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the actual diagonal model Afc A, however the PGMM selects the general model 
AfcDAD^, while the clusters are well separated {g = 3). 

Also in terms of misclassification error, as shown in Table 10, the proposed 
DPPM models, compared to the PGMM ones, provide partitions with the 
lower miscclassification error, for situations with poorly, well or very-well 
separated clusters, and for clusters with equal and different volumes (except 
for one situation). On the other hand, for the DPMM models, from the log 


PGMM 

48 ± 8.05 

9.5 ±3.68 

1±0.80 

23.5 ±2.89 

10.5 ±2.44 

2 ±1.69 

DPPM 

40 ±4.66 

7 ± 3.02 

3 ±0.97 

20.5 ± 3.34 

7 ± 3.73 

1.5 ±0.79 


Table 10: Misclassification error rates obtained by the proposed DPPM and the PGMM 
approach. From left to right, the situations respectively shown in Table 4, 5, 6, 7, 8, 9 


marginal lihelihoods shown in Tables 4 to 9, we can see that the evidence 
of the selected model, compared to the majority of the other alternative is, 
according to Table 2, in general decisive. Indeed, it can be easily seen that 
the value 21ogBFi2 of the Bayes Factor between the selected model, and 
the other models, is more than 10, which corresponds to a decisive evidence 
for the selected model. Also, if we consider the evidence of the selected 
model, against the more competitive one, one can see from Table 11 and 
Table 12, that, for the situation with very bad mixture separation, with 
clusters having the same volume, the evidence is not bad (0.3). However, 
for all the other situations, the optimal model is selected with an evidence 
going from an almost substantial evidence (a value of 1.7), to a strong and 
decisive evidence, especially for the models with different clusters volumes. 
We can also conclude that the models with different clusters volumes may 
work better in practice as highlighted by [5]. Finally, Figure (3) shows the 


Mi vs M 2 

Afel vs AA 

AA vs ADAD'' 

ADAD^ vs AfeDAD^ 

2 log BF 

0.30 

4.16 

1.70 


Table 11: Bayes factor values obtained by the proposed DPPM by comparing the selected 
model (denoted Mi) and the one more competitive for it (denoted M^). From left to right, 
the situations respectively shown in Table 4, Table 5 and Table 6 

best estimated partitions for the data structures with equal volume across 
the mixture components shown in Fig. 1 and the posterior distribution over 
the number of clusters. One can see that for the case of clusters with equal 
volume, the diagonal family (AA) with well separated mixture {g = 3) and 
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Mi vs M 2 

Afcl vs AfeA 

AfcA vs AfcDAD^ 

AfeDAD^ vs ADAD^ 

2 log BF 

6.16 

22 

19.04 


Table 12: Bayes factor values obtained by the proposed DPPM by comparing the selected 
model (denoted Mi) and the one more competitive for it (denoted M^). From left to right, 
the situations respectively shown in Table 7, Table 8 and Table (6) 9 





Figure 3: Partitions obtained by the proposed DPPM for the data sets in Fig. 1. 


the general family (ADAD^) with very well separated mixture {q = 4.5) 
data structure estimates a good number of clusters with the actual model. 
However, the equal spherical data model structure (AI) estimates the Afcl 
model, which is also a spherical model. Figure (4) shows the best estimated 
partitions for the data structures with different volume across the mixture 
components shown in Fig. 2 and the posterior distribution over the number of 
clusters. One can see that for all of different data structure models: different 
spherical Afcl, different diagonal AfcA and different general AfcDAD^, the 
proposed DPPM approach succeeded to estimate a good number of clusters 
equal to 2 with an actual cluster structure. 

4.1.3. Stability with respect to the variation of the hyperparameters values 
In order to illustrate the effect of the choice of the hyperparameters values 
of the mixture on the estimations, we considered two-class situations identical 
to those used in the parametric parsimonious mixture approach proposed 
in [15]. The data set consists in a sample of n = 200 observations from 
a two-component Gaussian mixture in with the following parameters: 
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Figure 4: Partitions obtained by the proposed DPPM for the data sets in Fig. 2. 


TTi = 7^2 = 0.5, = (8,8)^ and = (2,2)'^, and two spherical covariances 

with different volunies Si = 4 I 2 and S 2 = In Figure (5) we can see 
a siniulated data set froni this experinient with the corresponding actual 
partition and density ellipses. In order to assess the stability of the niodels 



Figure 5: A two-class data set simulated according to Afel, and the actual partition. 

with respect to the values of the hyperparanieters, we consider four situations 
with different hyperparameter values. These situations are as follows. The 
hyperparameters pq and /Xq are assumed to be the same for the four situations 
and their values are respectively pq = d + 2 = A (related to the number of 
degrees of freedom) and /Xq equals the empirical mean vecotr of the data. 
We variate the two hyperparameters, «o that controls the prior over the 
mean and Sq that controls the covariance. The considered four situations 
are shown in Table 13. We consider and compare four models corresponding 
to the spherical, diagonal and general family, which are AI, Afcl, A^A and 
AfcDAD^. Table 14 shows the obtained log marginal lihelihood values for 
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Sit. 

1 

2 

3 

4 

*o 

max(eig(cov(X))) 

max(eig(cov(X))) 

4 max(eig(cov(X))) 

max(eig(cov(X)))/4 

Ko 

1 

5 

5 

5 


Table 13: Four different situations the hyperparameters values. 


the four models for each of the situations of the hyperparameters. One can 
see that, for all the situations, the selected model is Afcl, that is the one that 
corresponds to the actual model, and has the correct number of clusters (two 
clusters). Also, it can be seen from Table 15, that the Bayes factor values 


Model 

AI 

Afcl 

AA 

AfcDAD^ 

Sit. 

K 

log ML 

K 

log ML 

K 

log ML 

K 

log ML 

1 

2 

-919.3150 

2 

-865.9205 

3 

-898.7853 

3 

-885.9710 

2 

3 

-898.6422 

2 

-860.1917 

2 

-890.6766 

2 

-885.5094 

3 

2 

-927.8240 

2 

-884.6627 

2 

-906.7430 

2 

-901.0774 

4 

2 

-919.4910 

2 

-861.0925 

2 

-894.9835 

2 

-889.9267 


Table 14: Log marginal lihelihood values for the proposed DPPM for 4 situations of 
hyperparameters values. 


(21ogBF), between the selected model, and the more competitive one, for 
each of the four situations, according to Table 2, corresponds to a decisive 
evidence of the selected model. These results conhrm the stability of the 


Sit. 

1 

2 

3 

4 

2 log BF 

40.10 

50.63 

32.82 

57.66 


Table 15: Bayes factor values for the proposed DPPM computed from Table 14 by com- 
paring the selected model (Mi, here in all cases Afel), and the one more competitive for it 
(M 2 , here in all cases AfcDAD). 


DPPM with respect to the variation of the hyparameters values. Figure 6 
shows the best estimated partitions obtained by the proposed DPPM for the 
generated data. Note that, for the four situations, the estimated number 
of clusters equals 2 for all the situations, and the posterior mode of the 
distribution of the number of clusters is very close to 1. 

4 . 2 . Experiments on real data 

To conûrm the results previously obtained on simulated data, we have 
conducted several experiments freely available real data sets: Iris, Old Faith- 
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Figure 6: Best estimated partitions obtained by the proposed Afcl DPPM for the four 
situations of of hyperparameters values. 


ful Geyser, Crabs and Diabetes whose characteristics are summarized in Ta- 
ble 16. We compare the proposed DPPM models to the PGMM models. We 
note that we have conducted several experiments on other data sets namely 
Trees, Wine, but for a lack of space, we choose to present the results for these 
four data sets. 


Dataset 

F data (n) 

F diniensions (d) 

True A clusters {K) 

Old Faithful Geyser 

272 

2 

Unknown 

Crabs 

200 

5 

2 

Diabetes 

145 

3 

3 

Iris 

150 

4 

3 


Table 16: Description of the used real data sets. 


4-2.1. Clustering of the Old Faithful Geyser data set 

The Old Faithful geyser data set [49] comprises n = 272 measurements 
of the eruption of the Old Faithful geyser at Yellowstone National Park in 
the USA. Each measurement is bi-dimensional {d = 2) and comprises the 
duration of the eruption and the time to the next eruption, both in minutes. 
While the number of clusters for this data set is unknown, several clustering 
studies in the literature estimate at two, often interpreted as short and long 
eruptions. 

We applied the proposed DPPM approach and the PGMM alternative 
to this data set (after standardization). For the PGMM, the value of K 
was varying from 1 to 6. Table 17 reports the log marginal lihelihood values 
obtained by the PGMM and the proposed DPPM for the Faithful Geyser 
data set. One can see that the parsimonious DPPM models estimate 2 clus- 
ters except one model, which is the diagonal model with equal volume AA 
that estimates three clusters. For a number of clusters varying from 1 to 6, 
the parsimonious PGMM models estimate two clusters at three exceptions. 
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DPPM 

PGMM 

Model 

K 

log ML 

K = 1 

K = 2 

K = 3 

K = 4 

K = 5 

K = 6 

AI 

2 

-458.19 

-834.75 

-455.15 

-457.56 

-461.42 

-429.66 

-1665.00 

Afcl 

2 

-451.11 

-779.79 

-449.32 

-454.22 

-460.30 

-468.66 

-475.63 

AA 

3 

-424.23 

-781.86 

-445.23 

-445.61 

-445.63 

-448.93 

-453.44 

Afc A 

2 

-446.22 

-784.75 

-461.23 

-465.94 

-473.55 

-481.20 

-489.71 

ADAD^ 

2 

-418.99 

-554.33 

-428.36 

-429.78 

-433.36 

-436.52 

-440.86 

AfcDAD^ 

2 

-434.50 

-556.83 

-420.88 

-421.96 

-422.65 

-430.09 

-434.36 

ADfeADl’ 

2 

-428.96 

-780.80 

-443.51 

-442.66 

-446.21 

-449.40 

-456.14 

AfcDfcAD|’ 

2 

-421.49 

-553.87 

-434.37 

-433.77 

-439.60 

-442.56 

-447.88 


Table 17: Log marginal likelihood values for the Old Faithful Geyser data set. 


including the spherical niodel AI which overestiniates the nuniber of clus- 
ters (provides 5 clusters). However, the solution provided by the proposed 
DPMM for the spherical niodel AI is niore stable and estiniates two clusters. 
It can also be seen that the best niodel with the highest value of the log 
niarginal lihelihood is the one provided by the proposed DPPM and corre- 
sponds to the general niodel ADAD^ with equal volunie and the sanie shape 
and orientation. On the other hand, it can also be noticed that, in terms 
of Bayes factors, the model ADAD^ selected by the proposed DPMM has a 
decisive evidence compared to the other models, and a strong evidence (the 
value of 21ogBF equals 5), compared to the most competitive one, which is 
in this case the model AfcDfcAD^. 

Figure 7 shows the data, this optimal partition and the posterior distri- 
bution for the number of clusters. One can namely observe that the lihely 
partition is provided with a number of cluster with high posterior probability 
(more than 0.9). 



Figure 7: Old Faithful Geyser data set (left), the optimal partition obtained by the DPPM 
model ADAD^ (middle) and the empirical posterior distribution for the number of mixture 
components (right). 
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4-2.2. Clustering of the Crabs data set 

The Crabs data set comprises n = 200 observations describing d = 6 mor- 
phological measnrements (Species, Frontal lip, Rearwidth, Length, Width 
Depth) on 50 crabs each of two colour forms and both sexes, of the species 
Leptograpsus variegatus collected at Fremantle, W. Australia [50]. The 
Crabs are classified according to their sex {K = 2). We applied the pro- 
posed DPPM approach and the PGMM alternative to this data set (after 
PCA and standardization). For the PGMM the value of K was varying from 
1 to 6. Table 18 reports the log marginal lihelihood values obtained by the 
PGMM the proposed DPPM approaches for the Crabs data set. One can first 



DPPM 

PGMM 

Model 

K 

log ML 

K = 1 

K = 2 

K = 3 

A = 4 

K = 5 

K = 6 

AI 

3 

-550.75 

-611.30 

-615.73 

-556.05 

-860.95 

-659.93 

-778.21 

Afcl 

3 

-555.91 

-570.13 

-549.06 

-538.04 

-542.31 

-577.22 

-532.40 

AA 

4 

-537.81 

-572.06 

-539.17 

-532.65 

-535.20 

-534.43 

-531.19 

Afc A 

3 

-543.97 

-574.82 

-541.27 

-569.79 

-590.48 

-693.42 

-678.95 

ADAD^ 

4 

-526.87 

-554.64 

-540.87 

-512.78 

-525.19 

-541.93 

-576.27 

AfcDAD^ 

3 

-517.58 

-556.73 

-541.88 

-515.93 

-530.02 

-550.71 

-595.38 

ADfcADj 

4 

-549.78 

-573.80 

-564.28 

-541.67 

-547.45 

-547.13 

-526.79 

AfcDfcAD^’ 

2 

-499.54 

-557.69 

-500.24 

-700.44 

-929.24 

-1180.10 

-1436.60 


Table 18: Log marginal likelihood values for the Crabs data set. 


see that the best solution corresponding to the best model with the highest 
value of the log marginal lihelihood is the one provided by the proposed 
DPPM and corresponds to the general model AfeDfcAD^ with different vol- 
ume and orientation but equal shape. This model provides a partition with 
a number of clusters equal to the actual one K = 2. One can also see that 
the best solution for the PGMM approach is the one provided by the same 
model with a correctly estimated number of clusters. On the other hand, 
one can also see that for this Crabs data set, the proposed DPPM models 
estimate the number of clusters between 2 and 4. This may be related to 
the fact that, for the Crabs data set, the data, in addition their sex, are also 
described in terms of their specie and the data contains two species. This 
may therefore result in four subgroupings of the data in four clusters, each 
couple of them corresponding to two species, and the solution of four clusters 
may be plausible for this data set. However three PGMM models overesti- 
mate the number of clusters and provide solutions with 6 clusters. We can 
also observe that, in terms of Bayes factors, the model AfcDfcAD^^ selected 
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by the proposed DPMM for this data set, has a decisive evidence compared 
to all the other potential models. For example the value of 2 log BF for this 
selected model, against to the most competitive one, which is in this case the 
niodel AfcDAD^ equals 36.08 and corresponds to a decisive evidence of the 
selected model. 

The good performance of the DPPM compared the PGMM is also con- 
firmed in terms of Rand index and misclassification error rate values. The 
the optimal partition obtained by the proposed DPPM with the parsimonious 
model AfcDfcAD^^ is the best defined one and corresponds to the highest Rand 
index value of 0.8111 and the lowest error rate of 10.5 ± 1.98. However, the 
partition obtained by the PGMM has a Rand index of 0.8032 with an error 
rate of 11 ± 2.07. Figure 8 shows the Grabs data, the optimal partition and 
the posterior distribution for the number of clusters. One can observe that 
the provided partition is quite precise and is provided with a number of clus- 
ters equal to the actual one, and with a posterior probability very close to 1. 



Figure 8: Crabs data set in the two first principal axes and the actual partition (left), the 
optimal partition obtained by the DPPM model AfcD^AD^ (middle) and the empirical 
posterior distribution for the number of mixture components (right). 

4-2.3. Clustering of the Diahetes data set 

The Diabetes data set was described and analysed in [51] consists of n = 
145 subjects, describing d = 3 features: the area under a plasma glucose curve 
(glucose area), the area under a plasma insulin curve (insulin area) and the 
steady-state plasma glucose response (SSPG). This data has K = 3 groups: 
the chemical diabetes, the overt diabetes and the normal (nondiabetic). We 
applied the proposed DPPM models and the alternative PGMM ones on this 
data set (the data was standardized). For the PGMM, the number of clusters 
was varying from 1 to 8. 

Table 19 reports the log marginal lihelihood values obtained by the two 
approaches for the Grabs data set. One can see that both the proposed 
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DPPM and the PGMM estimate correctly the true number of clusters. How- 
ever, the best model with the highest log marginal lihelihood value is the one 
obtained by the proposed DPPM approach and corresponds to the parsimo- 
nious model AfcDfcAD^ with the actual number of clusters {K = 3). Also, 



DPPM 

PGMM 

Model 

K 

log ML 

K = 1 

II 

to 

K = 3 

K = 4 

K = 5 

K = 6 

II 

-a 

K = 8 

AI 

4 

-573.73 

-735.80 

-675.00 

-487.65 

-601.38 

-453.77 

-468.55 

-421.33 

-533.97 

Afcl 

7 

-357.18 

-632.18 

-432.02 

-412.91 

-417.91 

-398.02 

-363.12 

-348.67 

-378.48 

AA 

8 

-536.82 

-635.70 

-492.61 

-488.55 

-418.51 

-391.05 

-377.37 

-370.47 

-365.56 

Afc A 

6 

-362.03 

-638.69 

-416.27 

-372.71 

-358.45 

-381.68 

-366.15 

-385.73 

-495.63 

ADAD^ 

7 

-392.67 

-430.63 

-418.96 

-412.70 

-375.37 

-390.06 

-405.11 

-426.92 

-427.46 

AfcDAD^ 

5 

-350.29 

-432.85 

-326.49 

-343.69 

-325.46 

-355.90 

-346.91 

-330.11 

-331.36 

ADfcADl’ 

5 

-338.41 

-644.06 

-427.66 

-454.47 

-383.53 

-376.03 

-356.09 

-355.03 

-349.84 

AfcDfcAD^ 

3 

-238.62 

-433.61 

-263.49 

-248.85 

-273.31 

-317.81 

-440.67 

-453.70 

-526.52 


Table 19: Obtained marginal likelihood values for the Diabetes data set. 


the evidence of the model AfcDfcAD^^ selected by the proposed DPMM for 
the Diabetes data set, compared to all the other models, is decisive. Indeed, 
in terms of Bayes factor comparison, the value of 2 log BF for this selected 
model, against to the most competitive one, which is in this case the model 
AD/j AD^ is 111.86 and corresponds to a decisive evidence of the selected 
model. In terms of Rand index, the best defined partition is the one obtained 
by the proposed DPPM approach with the parsimonious model AfcDfcAD^^, 
which has the highest Rand index value of 0.8081 which indicates that the 
partition is well defined, with a misclassification error rate of 17.24 ± 2.47. 
However, the best PGMM partition AfcDfcAD^ has a Rand index of 0.7615 
with 22.06±2.51 error rate. Figure (9) shows the data, this optimal partition 
provided by the DPPM model AfcD^AD^ and the distribution of the number 
of clusters K. We can observe that the partition is quite well defined (the 
misclassification rate in this case is 17.24 ± 2.47) and the posterior mode of 
the number of clusters equals the actual number of clusters {K = 3). 

4-2.4- Clustering of the iris data set 

The first data set is Iris, well-known and was studied by Fisher [52]. It 
contains measurements for n = 150 samples of Iris flowers covering three Iris 
species (setosa, virginica and versicolor) {K = 3) with 50 samples for each 
specie. Four features were measured for each sample {d = 4): the length 
and the width of the sepals and petals, in centimetres. We applied PGMM 
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Figure 9: Diabetes data set in the space of the coniponents 1 (glucose area) and 3 
(SSPG) and the actual partition (left), the optimal partition obtained by the DPPM 
model AfeDfcAD^ (middle) and the empirical posterior distribution for the number of 
mixture components (right). 


models and the proposed DPPM models on this data set. For the PGMM 
models, the nnmber of clusters K was tested in the range [1; 8]. 

Table 20 reports the obtained log marginal lihelihood values. We can see 
that the best solution is the one of the proposed DPPM and corresponds to 
the model AfcDfcAD^, which has the highest log marginal lihelihood value. 
One can also see that the other models provide partitions with two, three or 
four clusters and thus do not overestimate the number of clusters. However, 
the solution selected by the PGMM approach corresponds to a partition 
with four clusters, and some of the PGMM models overestimate the number 
of clusters. We also note that, the best partition found by the proposed 



DPPM 

PGMM 

Model 

K 

log ML 

K = 1 

K = 2 

K = 3 

K = 4 

K = 5 

K = 6 

K = 7 

K = 8 

AI 

Afel 

AA 

ADAD^ 

AfcDAD^ 

ADfcADj” 

•^fcDfcADj^ 

4 

3 

3 

3 

4 

2 

4 

2 

-415.68 

-471.99 

-404.87 

-432.62 

-307.31 

-383.72 

-576.15 

-278.78 

-1124.9 

-913.47 

-761.44 

-765.19 

-398.85 

-401.61 

-1068.2 

-394.68 

-770.8 

-552.2 

-585.53 

-623.89 

-340.89 

-330.55 

-761.71 

-282.86 

-455.6 

-468.21 

-561.65 

-643.07 

-307.77 

-297.50 

-589.91 

-451.77 

-477.67 

-488.01 

-553.41 

-666.76 

-286.96 

-279.15 

-529.52 

-676.18 

-431.22 

-507.8 

-546.97 

-688.16 

-291.7 

-282.83 

-489.9 

-829.07 

-439.35 

-528.8 

-539.91 

-709.1 

-296.56 

-296.24 

-465.37 

-992.04 

-423.49 

-549.62 

-535.37 

-736.19 

-300.37 

-304.37 

-444.84 

-1227.2 

-457.59 

-573.14 

-530.96 

-762.75 

-299.69 

-306.81 

-457.86 

-1372.8 


Table 20: Log marginal likelihood values for the Iris data set. 


DPPM, while in contains two clusters, is quite well dehned, and has a Rand 
index of 0.7763. 

The evidence of the selected DPPM models, compared to the other ones, 
for the four real data sets, is significant. This can be easily seen in the tables 
showing the log marginal lihelihood values. Gonsider the comparison between 
the selected model, and the more competitive for it, for the four real data. 
As it can be seen in Table 21, which reports the values of 2 logBF of the best 
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Figure 10: Iris data set in the space of the components 3 (petal length) and 4 (petal width) 
(left), the optimal partition obtained by the DPPM model AfcDfcAD|^ (middle) and the 
empirical posterior distribution for the number of mixture components (right). 


model against the second best one, that the evidence of the selected model, 
according to Table 2 is strong for Old Faithful geyser data, and very decisive 
for Crabs, Diabetes and Iris data. Also, the model selection by the proposed 
DPMM for these latter three data sets, is made with a greater evidence, 
compared to the PGMM approach. 


I Data set | Old Faithful Geyser | Crabs | Diabetes | Iris 


DPPM 

ADAD^ vs AfcDfcADÍ 

XjçJDjçA.JDy vs A^DAD"^ 

-^fcD/jAD^ vs ADfcADjJ. 

A;,D;,ADÍ vs ADAD^ 

2 log BF 

5 

36.08 

199.58 

57.06 

PGMM 

AfcDAD^ vs ADAD^ 

AfcD;,AD^ vs ADAD^ 

^'fcDfcAD^ vs AfcDAD^ 

AfcDAD^ vs AfcD;i,AD^ 

2 log BF 

14.96 

25.08 

153.22 

7.42 


Table 21: Bayes factor values for the selected model against the more competitive for it, 
obtained by the PGMM and the proposed DPPM for the real data sets. 

5. Conclusion 

In this paper we presented Bayesian nonparametric parsimonious mix- 
ture models for clustering. It is based on an inûnite Gaussian mixture with 
an eigenvalue decomposition of the cluster covariance matrix and a Dirichlet 
Process, or by equivalence a Chinese Restaurant Process prior. This aIlows 
deriving several flexible models and avoids the problem of model selection 
encountered in the standard maximum lihelihood-based and Bayesian para- 
metric Gaussian mixture. We also proposed a Bayesian model selection an 
comparison framework to automatically select, the best model, with the best 
number of components, by using Bayes factors. 

Experiments of simulated data highlighted that the proposed DPPM rep- 
resent a good nonparametric alternative to the standard parametric Bayesian 
and non-Bayesian ûnite mixtures. They simultaneously and accurately esti- 
mate accurate partitions with the optimal number of clusters also inferred 
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from the data. We also applied the proposed approach on real data sets. The 
obtained results show the interest of using the Bayesian parsimonious cluster- 
ing models and the potential benefit of using them in practical applications. 
A future work related to this proposal may concern other parsimonious mod- 
els such us those recently proposed by [53] based on a variance-correlation 
decomposition of the group covariance matrices, which are stable and visu- 
alizable and have desirable properties. 

Until now we have only considered the problem of clustering. A per- 
spective of this work is to extend it to the case of model-based co-clustering 
[54] with block mixture models, which consists in simultaneously cluster in- 
dividuals and variables, rather that only individuals. The nonparametric 
formulation of these models may represent a good alternative to select the 
number of latent blocks or co-clusters. 
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